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1. INTRODUCTION 

We are grateful for the opportunity to comment 
on Professors Lee and Nelder's work. Like their other, 
related papers, they are replete with important ideas 
and thought-provoking theses. We will take up just 
a few of the topics touched upon and offer some 
reflections. In Section 2, we consider models with, 
and inferences for, unobservables. Section 3 revis- 
its one of the counterexamples discussed by the au- 
thors which turns out to have insightful connections 
with the Cauchy distribution. Connections between 
generalized estimating equations and fully specified 
joint distributions are touched upon in Section 4. 
Finally, the position of the authors' computational 
proposals among alternative routes is explored in 
Section 5. 

2. THE NATURE OF UNOBSERVABLES 

Although Lee and Nelder claim, in their Section 4.4, 
that unobservables are often verifiable because the 
unobservables are latent variables for observed data, 
we would like to issue a word of caution. Evidently, 
using the authors' notation, variance components A 
and <fi are identifiable from the data, whenever there 
is replication within the i levels (e.g., repeated mea- 
sures j on subject i, or litter mates j correspond- 
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ing to dam i). However, such data-based verifica- 
tion is confined, in the strict sense, to the way the 
induced marginal model is accurate. This model is of 
a compound-symmetry type: 
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Y i ~JV(l / 9,V r i = AJ^ + 0J ni ). 



In other words, one can assess from the data whether 
Vi is an accurate description of the variance-covariance 
structure, just as one can verify whether the implied 
constant mean is adequate. But whether this pro- 
vides an accurate description of the unobservables 
is another matter because there is a many-to-one 
map of hierarchical models to (1). 

To see this more clearly, we start from this com- 
pound-symmetry setting and take an example of 
Molenberghs and Verbeke (2010). Consider the fol- 
lowing random-intercepts model: 



(2) 



= x- •£ + bi + Si 



1, 



, m of 



,N, vector of known covari- 



where Y^ is the response for member j 
cluster i = l, 

ate values, £ is a vector of unknown regression coeffi- 
cients and bi ~ N(0, A 2 ) is a cluster-specific random 
effect assumed to be independently distributed from 
the residual error components e^- ~ iV(0,^ 2 ). The 
implied marginal model is obtained by integrating 

(2) over the random effects. Grouping the into a 
vector Yj and assembling the rows x^- into a matrix 
Xi, this marginal distribution is 

(3) Y~iV(X^,A 2 J ni +^%), 

in which I ni denotes the identity matrix of dimen- 
sion rij, and where J ni equals the n« x re, matrix 
containing only ones. Evidently, (1) is a particular 
instance of (3). 

Traditionally, there have been two views regarding 
the variance component A 2 in the above model. In 
the first, where the focus is entirely on the resulting 
marginal model (3), negative values for A 2 are per- 
fectly acceptable (Nelder, 1954; Verbeke and Molen- 
berghs, 2000, Section 5.6.2) because this merely cor- 
responds to the occurrence of negative within-cluster 
correlation p = A 2 /(A 2 + u 2 ). In such Si CcLSG, the only 
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requirement is that A 2 + v 2 > and V$ 



A 2 Jn, + 



v I ni is positive definite. In the second view, when 
the link between the marginal model (3) and its gen- 
erating hierarchical model (2) is preserved, thereby 
including the concept of random effects hi and per- 
haps even requiring inferences for them, it has been 
considered imperative to restrict A 2 to nonnegative 
values. 

In such a view, it is implicit that any hierarchical 
model, corresponding to the compound-symmetry 
model (3), should be of the form (2). But this is not 
the case. To see this, we first reiterate that the hi- 
erarchical model, corresponding to a given marginal 
model, is nonunique. This originates from the ran- 
dom effects' latency and is crucial to the theme of 
Lee and Nelder's current paper. 

To illustrate this nonuniqueness, consider the sim- 
ple but insightful case of two measurements per sub- 
ject, that is, rij = 2. We contrast two models, the 
first one of the form (2) with random intercepts 6j ~ 
iV(0,A 2 ) and heterogeneous errors Eij ~ N(0, v 2 ), 
(j = 1,2). The second takes the following form: 



(4) 



Yij = x- •£ + b i + b u (j - 1) + Si 



with two uncorrelated random effects, 



hi 



N 



A? 






\2 
A 2 



and homogeneous error s-ij ~ N(0, v 2 ). The marginal 
means are, obviously, equal. At the same time, the 
marginal variance-covariance matrix in the first 
model is 



(5) 



(A 2 )(l 1) + 







A 2 + ^ 2 
A 2 



A 2 + vl 



and the counterpart for the second model is 



(6) 
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A? 

A 2 + A 2 + z, s 



Evidently, V^> and are equivalent, through the 



linear relationships Aj = A 



X 2 

A 2 



/ 2 , and v 2 



u(. What this means, in this case, is that the ob- 
served heterogeneity in variance can be ascribed to 
either heterogeneous residual errors or to the pres- 
ence of a random slope. The fitted marginal model, 
and hence the data, cannot be used to distinguish 
between these two scenarios. Thus, one view is that 
fitting a marginal model comes with an entire equiv- 
alence class of hierarchical models that reduce to the 
given marginal model. 

We now show that another simple extension of the 
distributional assumptions of (2) allows for negative 
variance components, while maintaining the model's 
random- intercepts interpretation. 

We retain the random- intercepts model (2), but 
now with the assumption, 

/ h \ 
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The induced conditional distribution of the mea- 
surement error vector, given the random intercept, 
is 



(S) 



N 



Tb i. 1 , , 2 r 2 T > 



Here, j„ 4 is a ra^-vector of ones. Note that (7) al- 
lows for marginally uncorrelated measurement er- 
rors that become correlated, conditional upon the 
random effect. 
The corresponding marginal model is 

(9) Yi ~ N[X£, (d + 2r) J„. + a 2 I ni ]. 

Starting from a conventional compound-symmetry 
model, it is clear that a 2 = v 2 and d + 2r = A 2 . Evi- 
dently, d and r are not jointly identified, pointing to 
a collection of hierarchical models that all yield the 
same marginal model and hence are indistinguish- 
able based on the data alone. To define the range of 



Table 1 



a 2 d 2r 

var(r y ) = v 2 + (A 2 + 2v 2 + 2va^JX i + v 2 ) + (~2u 2 - 2va^\ 2 + v 2 ) 
cov{Y i3 ,Y ik ) = (A 2 + 2v 2 + 2uaVX 2 + f 2 ) + \-2v 2 - 2vay/\ 2 + v 2 ) 
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this collection, it is necessary for r 2 < da 2 . Together 
with r = (A 2 — d)/2, this leads to the set of allowable 
solutions, 

(10) d = A 2 + 2u 2 + 2va\j\ 2 + i/ 2 , 
with a£ [—1,1]. From (10), we find 

(11) t = -{v 2 + ua^\ 2 + v 2 ). 

In other words, we have a decomposition of the vari- 
ance and the covariance, as displayed in Table 1. 
Observe that, when A 2 is positive, r = is among 
the solutions; that is, this recovers the conventional 
random- intercepts model with uncorrelated errors. 
However, when A 2 < 0, then all values for r are nec- 
essarily negative. 

This construction reduces the conventional random- 
intercepts model (2) to a special case of the extended 
family of hierarchical models with correlation be- 
tween random effects and measurement errors. Once 
again, it is clear that one can make inferences about 
the random effects, given that one is prepared to 
make strong, and unverifiable assumptions about 
the hierarchical model, stemming from the many- 
to-one map from hierarchical models to the implied 
marginal models. In this sense, the above derivations 
underscore, not just that every compound-symmetry 
model can be induced by a hierarchical model, but 
that an entire collection of random-intercepts mod- 
els fulfills this role. These differ from each other by 
the degree to which the random intercepts and mea- 
surement errors are correlated. 

The above considerations focus on random effects. 
This is but one example of unobservables. Like Lee 
and Nelder, Verbeke and Molenberghs (2010) ar- 
gue that so-called augmented data, in the sense of 
supplementing the observed data with latent, unob- 
served structures, is common throughout statistics. 
Examples include models for incompletely observed 
data, describing observed and unobserved outcomes 
alike, random-effects models, latent class models, la- 
tent variable models, censored survival data, etc. 
Heitjan and Rubin (1991) and Zhang and Heitjan 
(2007) have unified some of these settings in a con- 
cept called coarsening, broadly referring to the fact 
that the observed data are coarser than the hypo- 
thetically conceived data structures while models 
target the latter. Generally, models for augmented 
structures are identifiable only by virtue of mak- 
ing sometimes strong but always partially unveri- 
fiable modeling assumptions. These settings taken 



together are termed enriched data by Verbeke and 
Molenberghs (2010). Of course, there is a subtle 
distinction between both concepts. In the coarse- 
data setting, it is understood that a part of the 
data would ideally be observed but are not in prac- 
tice (e.g., actual survival time after censoring, out- 
comes after dropout, etc.). Augmented data refers 
rather to the addition of useful but artificial con- 
structs to the data setting, such as random effects, 
latent classes and latent variables which are never 
directly observed. Such augmentations permit sim- 
ple model development and represent a very power- 
ful tool to succinctly accommodate posited, poten- 
tially very complex, often causal, real-world struc- 
tures, a fact of which Lee and Nelder also make use. 

Verbeke and Molenberghs (2010) show that every 
model for enriched-data settings can be factored as 
a product of two components: the first one, termed 
the marginal model, is fully identifiable from the 
observed data; the second one, the conditional dis- 
tribution of the enriched data given the observed 
data, is entirely arbitrary. The evident consequence 
is that the identification of such a part can come 
from assumptions only and points at the same time 
to the considerable risk for conclusions to be sen- 
sitive to such assumptions, and ultimately to the 
need for conducting sensitivity analyses. It implies 
that such non-identified parts can be replaced arbi- 
trarily, without altering the fit to the observed data 
but with potentially huge implications for inferences 
and substantive conclusions. Put simply, one's in- 
ferential conclusions may strongly depend on such 
unverifiable portions of the model. 

In the missing data case, studied in more detail 
by Molenberghs et al. (2008), one could identify the 
second factor by requiring, for example, that it is of 
the MAR type. This means that every model as- 
suming that the missing data are missing not at 
random corresponds to another model, producing 
exactly the same fit to the observed data, but now 
assuming that the missing data are missing at ran- 
dom. In the context of a conventional linear mixed 
model, Verbeke and Molenberghs (2010) illustrate 
the implications of the result by replacing the con- 
ditional distribution of the random effects given the 
data, that is, the random effects' posterior, by two 
families of exponential distributions, special cases of 
the gamma family for the sake of illustration. This 
nicely supplements the above compound-symmetry 
model with correlated random effects and measure- 
ment errors. 
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These results imply that one should not simply 
adopt a hierarchical model, only because it is con- 
venient, is in common use, etc. Rather, one should 
carefully reflect on that part of the model that can- 
not be critiqued by the data. One should strive for 

(1) better understanding of the dependence of one's 
inferences on nonverifiable model components and 

(2) developing sensitivity analysis tools regarding 
substantive conclusions with respect to data enrich- 
ment. Generally speaking, inferences relative to ob- 
served data only, such as fixed-effects and variance- 
component parameter estimates, are unaffected by 
the choice of enrichment model. However, such as- 
pects as empirical Bayes predictions in linear mixed 
models, or predictive distributions of unobserved mea- 
surements given observed ones, strongly rest on un- 
verifiable modeling assumptions. This points to the 
need for sensitivity analysis. Rather than fitting a 
single model and putting blind belief in it, it is more 
reasonable to consider a discrete or continuous set of 
alternative model formulations and assess how key 
inferences are vulnerable to choices made. Molen- 
berghs and Kenward (2007) discuss avenues for sen- 
sitivity analysis. 

As soon as one is aware of the lack of identifica- 
tion, there is reasonable latitude in making prag- 
matic identification choices. For example, with ran- 
dom effects or other latent structures, one could 
express a preference for conjugate priors (Lee and 
Nelder, 1996, 2001, 2003) because, in the absence of 
identification, the convenience and appeal of conju- 
gacy may be invoked. 

3. BAYARRI'S EXAMPLE AND A 
CAUCHY-TYPE FAMILY OF DISTRIBUTIONS 

Lee and Nelder, in their Section 4.2, revisit Ba- 
yarri's example. There is something rather peculiar 
about it, because it is of a Cauchy type. We will show 
this in what follows. Owing to the model's absence 
of finite moments, it seems natural that an estima- 
tion method ought to encounter problems. Indeed, 
any approach that does purport to provide estimates 
in such circumstances must raise concerns about its 
properties. 

Consider a Weibull model for repeated measures 
with gamma random effects: 



Here, i and j are as in Section 2, Oij are gamma 
random effects, Xj, are covariates, £ regression pa- 
rameters, p the Weibull shape parameters, and otj 
and /3j the parameters governing the gamma distri- 
bution for the jth component. 

Rather than the above two-parameter gamma den- 
sity, it is customary in a gamma frailty context 
(Duchateau and Janssen, 2007) to set otj/3j = 1, for 
reasons of identifiability. 

In line with Bayarri's example, we use the less 
conventional constraint ay = 1 and f3j = lead- 
ing to 

(14) 



(12) /(yi|0i 



e % 3 



(13) m) = n 



a, 



i=i 

and implying that the gamma density is reduced to 
an exponential one. Details can be found in Molen- 
berghs et al. (2009). 

The moments take the following form: 



(15) 



5 k J p k 



x exp 



Reducing the Weibull distribution to the exponen- 
tial one, that is, setting p = 1, we further find 

(16) E(Y*) = ^r(l - k)T(k)exp(-k^). 

The cases corresponding to (15) and, especially, (16) 
will allow us to make our point about Lee and Nelder's 
example. Generally, T{a — k/p) poses a problem when 
a — k/p is a negative integer. For simplicity focusing 
on a single outcome Y for the case where a = 1 and 
/3 = 1/5, assembling the linear predictor in p, and 
writing <p = Ae^ , we find 



(17) f(y) 



{8 + ipyPf 



(18) E(Y k ) = - ( 5 -\ lP ■ T(l - k/p) ■ T{k/p). 

pKvJ 

Note that (17) provides a family of distributions, 
special cases of the Weibull-gamma model that are 
termed Weibull-exponential by Molenberghs et al. 
(2009). Considering further the exponential case with 
p = 1, yields exponential-exponential distributions, 
with 

ip8 



(19) 



f(y) 



(5 + <py)< 
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(20) E{Y k ) = k[^j -Y{l-k)-T(k). 

Clearly, (19) defines a family of distributions with- 
out finite moments similar to the Cauchy distribu- 
tion because F(l — k) is undefined for k = 1,2, 

When p 7^ 1 but is fractional, some but not all mo- 
ments exist whereas for irrational values of p, all mo- 
ments in (18) are properly defined. Finally, observe 
that in the general case, there are combinations pos- 
sible for (a, p, k) that would lead to negative integers 
and hence undefined moments (15). 

In the light of the above developments, we are 
concerned that Lee and Nelder provide us with point 
estimates for moments that are undefined. 

4. THE NATURE OF GENERALIZED 
ESTIMATING EQUATIONS 

Lee and Nelder touch upon the use of general- 
ized estimating equations (GEE), as opposed to fully 
specified probability models. We agree that a com- 
parison between GEE to generalized linear mixed 
models (GLMM) or hierarchical generalized linear 
models (HGLM) is like a comparison of apples to or- 
anges because GEE is an estimating method which 
can be applied to random-effects models too (Zeger, 
Liang and Albert, 1988) and to HGLM for that 
matter. Therefore, again in agreement with Lee and 
Nelder, we would like to reiterate that a proper ba- 
sis of comparison is between marginal and random- 
effects models. Thus, while part of the literature is 
sloppy in making comparisons using sloppy catego- 
rizations, it should not deflect from the real issues. 
Nevertheless, we would like to reflect on whether 
GEE are indeed less appealing because of their al- 
leged lack of a probabilistic basis. 

There are two important points in our view. First, 
a fully specified probability model is not always es- 
sential when making inferences about a particular 
aspect of the model, such as mean functions, as long 
as the appropriate regularity conditions are satis- 
fied. For example, when estimating mean parame- 
ters, it is imperative that the mean exists and is 
finite. 

Second, as Molenberghs and Kenward (2010) point- 
ed out, the lower-order moments that need to be 
formulated when setting up GEE, correspond to at 
least one fully specified probabilistic model, even 
though it may not be of the simple, elegant type, one 
has in mind a priori, such as the Bahadur (1961) or 
odds-ratio model (Molenberghs and Lesaffre, 1994). 



Their work addresses the concern regarding whether 
the model portions used in GEE can always be viewed 
as a partially specified version of a model with full 
distributional assumptions, or rather, whether such 
a parent simply does not exist. To this end, they 
use the hybrid models (in the sense of being par- 
tially marginally and partially conditionally speci- 
fied) of Fitzmaurice and Laird (1993) and Molen- 
berghs and Ritter (1996). The results by Molen- 
berghs and Kenward (2010) are broadly valid. First, 
they are valid for a wide class of semi-parametric 
models where specification is done in terms of (parts 
of) the exponential-family formulation, including bi- 
nary, nominal, ordinal, and Poisson outcomes. Sec- 
ond, it is also valid when the outcome vector com- 
bines outcomes of different types. Third, using trans- 
formations, the result can be applied as well when 
the semi-parametric specification is not directly in 
terms of the exponential family, such as logistic re- 
gressions for binary data coupled with pairwise cor- 
relation, as in classical generalized estimating equa- 
tions. 

5. COMPUTATIONAL APPROACHES 

We are convinced that /i-likelihood is a tremen- 
dously appealing and important addition to the lit- 
erature. Other computational principles and tech- 
niques for maximizing a likelihood with unobserv- 
ables exist as well. Each one of them has its ad- 
vantages and drawbacks. For example, Taylor-series- 
based methods, such as PQL and MQL, and Laplace 
approximations often lead to substantial bias. This 
is important to realize because they have been in 
common use, nevertheless, not in the least because 
they are implemented in standard statistical soft- 
ware such as the SAS procedure GLIMMIX. 
The numerical-integration-based methodology, 
implemented for example in the SAS procedure 
NLMIXED, is frequently slow and/or extremely sen- 
sitive to starting values. See also Molenberghs and 
Verbeke (2005). Also Bayesian methods, sometimes 
believed to be free of the issues arising in a likeli- 
hood or frequentist context, also have their prob- 
lems. For one, the sensitivity arising from unobserv- 
ables, as discussed in Section 2, is equally present 
in this framework. It is clear to us that none of the 
computational approaches will be able to claim uni- 
form superiority over all others. 

Molenberghs et al. (2009) provide an overview of 
computational methods, including some less familiar 
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ones. Their context is a hierarchical model with both 
normally distributed and conjugate or other random 
effects. Each of them deals in its own way with the 
lack of closed-form expression for the marginal likeli- 
hood, even though Molenberghs et al. (2009) derive 
such closed forms for more settings such as general 
Poisson, probit and Weibull models with random ef- 
fects. 

One approach, very specific to the setting of Molen- 
berghs et al. (2009), is to integrate analytically over 
conjugate random effects and then further numeri- 
cally over the normally distributed random effects. 

For the specific case of the marginalized probit 
model, the computational challenge stems from the 
presence of a high-dimensional multivariate normal 
integral in the marginal distribution. Zeger, Liang 
and Albert (1988) derived the marginal mean func- 
tion, needed for their application of generalized es- 
timating equations fitting algorithm for the 
marginalized probit model. It is one of the first in- 
stances of the use of GEE to a nonmarginally spec- 
ified model. 

In the same spirit, pseudo-likelihood can be used 
(Aerts et al., 2002; Molenberghs and Verbeke, 2005). 
This is particularly useful when the joint marginal 
distribution is available but cumbersome to manip- 
ulate and evaluate, such as in the probit case. This 
is the idea followed by Renard, Molenberghs, and 
Geys (2004) for a multilevel probit model with ran- 
dom effects. 

Schall (1991) proposed an efficient and general es- 
timation algorithm, based on Harville's (1974) mod- 
ification of Henderson's (1984) mixed- model equa- 
tions. Hedeker and Gibbons (1994) and Gibbons 
and Hedeker (1997) proposed numerical-integration 
based methods, thus considering neither marginal 
moments (means, variances) nor marginalized joint 
probabilities. Guilkey and Murphy (1993) provide 
a useful early overview of estimation methods and 
then revert to Butler and Moffit's (1982) Hermite- 
integration based method, supplemented with Monte 
Carlo Markov chain ideas. Also the EM algorithm 
can be used, in line with Booth et al. (2003) for the 
Poisson case. The EM is a flexible framework within 
which random effects can be considered the "miss- 
ing" data over which expectations are taken. Booth 
et al. (2003) also considered nonparametric maxi- 
mum likelihood, in the spirit of Aitken (1999) and 
Alfo and Aitkin (2000). 

A suite of methods is available that employ trans- 
formation results, essentially based on transform- 
ing the nonnormal random effects to normal ones, 



or vice versa. Liu and Yu (2008) propose a sim- 
ple transformation of a nonnormal random effect 
to a normal one, at density level, upon which the 
SAS procedure NLMIXED or similar software can 
be used. Nelson et al. (2006) advocate the transfor- 
mation, Ui = F~ 1 \{^{ai)\ where F u is the cumula- 
tive distribution function (CDF) of U{, and <!>(•) is 
the standard normal CDF, as before. The method 
of Nelson et al., labeled probability integral transfor- 
mation (P.I.T.), comes down to generating normal 
variates and then inserting these in the model only 
after transformation, ensuring that they are of the 
desired nature. Lin and Lee (2008) present estima- 
tion methods for the specific case of linear mixed 
models with skew-normal, rather than normal, ran- 
dom effects. 

Quite apart from the choice of estimation method, 
it is important to realize that not all parameters 
may be simultaneously identifiable. For example, the 
gamma-distribution parameters in the Poisson case, 
a and f3, such as in (13), are not simultaneously 
identifiable when the linear-predictor part is also 
present because there is aliasing with the intercept 
term. Therefore, one can set, for example, j3 equal to 
a constant, removing the identifiability problem. It 
is then clear that ce, in the univariate case, or the set 
of ctj in the repeated-measures case, describes the 
additional overdispersion in addition to what stems 
from the normal random effect (s). A similar phe- 
nomenon also plays in the binary case, where both 
beta-distribution parameters are not simultaneously 
estimable. 

6. CONCLUDING REMARKS 

We end by sincerely thanking Professors Lee and 
Nelder for their important, thought-provoking, and 
practically relevant work in this rapidly evolving 
area. Their paper has allowed us to elaborate on 
a number of statistical issues put forward in their 
paper, such as the implications of formulating mod- 
els with unobservables and generating distributions 
with peculiar moment-properties. It further gave us 
the opportunity to reflect on some conceptual as- 
pects of generalized estimating equations on one hand, 
and elaborate on computational strategies for mod- 
els of the type discussed here on the other. 
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